Skip to content

bpo-44895: Temporarily add an extra gc.collect() call #27746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 13, 2021

Conversation

iritkatriel
Copy link
Member

@iritkatriel iritkatriel commented Aug 12, 2021

This is part of an investigation of a non-deterministic reference leak. While we're looking for the root cause, this is included temporarily so that CI doesn't fail on this particular issue. This enables it to find other regressions in the meantime, which would otherwise be shadowed by our known issue.

https://bugs.python.org/issue44895

@iritkatriel
Copy link
Member Author

Let's run this on buildbots and if it indeed resolves the leak we can do this for now instead of skipping the test.

@iritkatriel iritkatriel added 🔨 test-with-buildbots Test PR w/ buildbots; report in status section skip news labels Aug 12, 2021
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @iritkatriel for commit a6a8b1c 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Aug 12, 2021
@iritkatriel iritkatriel changed the title bpo-44895: temporarily add a gc call to unbreak the built while we in… bpo-44895: temporarily add a gc call to unbreak the build while we in… Aug 12, 2021
@iritkatriel
Copy link
Member Author

@vstinner The buildbot tests passed. This might quiet down the ci. What do you think?

@ambv ambv added the needs backport to 3.10 only security fixes label Aug 13, 2021
@ambv ambv merged commit 7bf28cb into python:main Aug 13, 2021
@miss-islington
Copy link
Contributor

Thanks @iritkatriel for the PR, and @ambv for merging it 🌮🎉.. I'm working now to backport this PR to: 3.10.
🐍🍒⛏🤖

@bedevere-bot
Copy link

GH-27753 is a backport of this pull request to the 3.10 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Aug 13, 2021
This is part of an investigation of a non-deterministic reference leak. While we're looking for the root cause, this is included temporarily so that CI doesn't fail on this particular issue. This enables it to find other regressions in the meantime, which would otherwise be shadowed by our known issue.
(cherry picked from commit 7bf28cb)

Co-authored-by: Irit Katriel <[email protected]>
@ambv ambv changed the title bpo-44895: temporarily add a gc call to unbreak the build while we in… bpo-44895: Temporarily add an extra gc.collect() call Aug 13, 2021
miss-islington added a commit that referenced this pull request Aug 13, 2021
This is part of an investigation of a non-deterministic reference leak. While we're looking for the root cause, this is included temporarily so that CI doesn't fail on this particular issue. This enables it to find other regressions in the meantime, which would otherwise be shadowed by our known issue.
(cherry picked from commit 7bf28cb)

Co-authored-by: Irit Katriel <[email protected]>
@@ -1014,6 +1014,9 @@ def cycle():

def test_no_hang_on_context_chain_cycle2(self):
# See issue 25782. Cycle at head of context chain.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix looks incorrect, tests should not depend on the GC to pass. When this happens, is a symptom of another problem.

I propose to revert this commit and investigate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC: @vstinner

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand is a "temporary measure" but in my experience those are left there with no fixes more often than not. Also, I don't feel comfortable with "
temporary fixes in the release candidate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is to disable the test. That doesn't fix the issue either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to deactivate the test. The reason is that relying on the GC in this way at the end has global effects and can mask other issues. Is also not deterministic and can actually be an endless loop in some extreme situations involving resurrection.

This is just my opinion on this of course, If the consensus is to leave this because the test has more value, then let's leave it, but I have to say that my previous experience with these kind of fixes is that they are left there more often than not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry, is not urgent.

Thanks a lot for the investigation and for all the work!! 🚀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel comfortable with " temporary fixes in the release candidate.

Sure, we weren't going to let this slip into RC2. The point was to make refleak tests able to catch other regressions on that branch in the mean time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd rather redo the fix as a skip instead of gc.collect() then that's fine as well. However, from what I understood on the PR, having it run on the entire buildbot fleet for a few days would give us more confidence whether that approach to working around the refleaks is even effective.

How about we leave it as is for the weekend and remove the gc.collect() loop on Monday?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd rather redo the fix as a skip instead of gc.collect() then that's fine as well. However, from what I understood on the PR, having it run on the entire buildbot fleet for a few days would give us more confidence whether that approach to working around the refleaks is even effective.

I don't get what you mean by this. Why do we want to know if the approach to work around is effective? What information do we gain by this? I can understand the though that this may gives us some more light into the problem but this workaround is too intrusive to gather any conclusions from the actual problem, more then that a cycle is likely involved.

How about we leave it as is for the weekend and remove the gc.collect() loop on Monday?

👍 Works for me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to know if the approach to work around is effective?

Irit wrote:

Let's run this on buildbots and if it indeed resolves the leak we can do this for now.

I misinterpreted this as "let's merge this and see" but obviously she meant the test-with-buildbots label. Nevermind!

@vstinner
Copy link
Member

@vstinner The buildbot tests passed. This might quiet down the ci. What do you think?

I'm fine with the workaround to unblock buildbots, but https://bugs.python.org/issue44895 must only be closed when the root issue is identified.

regrtest test runner runs gc.collect(). regrtest -R 3:3 runs gc.collect() one more time. So it's strange that you have to add a third gc.collect() call.

The worst case that I saw was a bug in a type implemented in C: https://vstinner.github.io/subinterpreter-leaks.html Calling gc.collect() worked around this bug. But I had to fix the C type (_thread.Lock) to fix the root issue. I don't think that it's the same bug here, since the leak was only seen when an interpreter was destroyed. Here the leak is seen at each loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip news tests Tests in the Lib/test dir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants